---
id: components
sidebar_label: Pipeline Components
title: Components
abstract: Components make up your NLU pipeline and work sequentially to process user input
  into structured output. There are components for entity extraction, for intent classification, response selection,
  pre-processing, and more.
---


## Language Models

The following components load pre-trained models that are needed if you want to use pre-trained
word vectors in your pipeline.


### MitieNLP


* **Short**

  MITIE initializer



* **Outputs**

  Nothing



* **Requires**

  Nothing



* **Description**

  Initializes MITIE structures. Every MITIE component relies on this,
  hence this should be put at the beginning
  of every pipeline that uses any MITIE components.



* **Configuration**

  The MITIE library needs a language model file, that **must** be specified in
  the configuration:

  ```yaml-rasa
  pipeline:
  - name: "MitieNLP"
    # language model to load
    model: "data/total_word_feature_extractor.dat"
  ```

  For more information where to get that file from, head over to
  [installing MITIE](./installation/installing-rasa-open-source.mdx#dependencies-for-mitie).


  You can also pre-train your own word vectors from a language corpus using MITIE. To do so:

  1. Get a clean language corpus (a Wikipedia dump works) as a set of text files.

  2. Build and run [MITIE Wordrep Tool](https://github.com/mit-nlp/MITIE/tree/master/tools/wordrep) on your corpus.
     This can take several hours/days depending on your dataset and your workstation.
     You'll need something like 128GB of RAM for wordrep to run – yes, that's a lot: try to extend your swap.

  3. Set the path of your new `total_word_feature_extractor.dat` as the `model` parameter to the `MitieNLP` component in your
     [configuration](./model-configuration.mdx) file.

  For a full example of how to train MITIE word vectors, check out
  [用Rasa NLU构建自己的中文NLU系统](http://www.crownpku.com/2017/07/27/%E7%94%A8Rasa_NLU%E6%9E%84%E5%BB%BA%E8%87%AA%E5%B7%B1%E7%9A%84%E4%B8%AD%E6%96%87NLU%E7%B3%BB%E7%BB%9F.html),
  a blogpost that goes through creating a MITIE model from a Chinese Wikipedia dump.



### SpacyNLP


* **Short**

  spaCy language initializer



* **Outputs**

  Nothing



* **Requires**

  Nothing



* **Description**

  Initializes spaCy structures. Every spaCy component relies on this, hence this should be put at the beginning
  of every pipeline that uses any spaCy components.



* **Configuration**

  You need to specify the language model to use. The name will be passed to `spacy.load(name)`.
  You can find more information on the available models on the [spaCy documentation](https://spacy.io/usage/models).

  ```yaml-rasa
  pipeline:
  - name: "SpacyNLP"
    # language model to load
    model: "en_core_web_md"

    # when retrieving word vectors, this will decide if the casing
    # of the word is relevant. E.g. `hello` and `Hello` will
    # retrieve the same vector, if set to `False`. For some
    # applications and models it makes sense to differentiate
    # between these two words, therefore setting this to `True`.
    case_sensitive: False
  ```

  For more information on how to download the spaCy models, head over to
  [installing SpaCy](./installation/installing-rasa-open-source.mdx#dependencies-for-spacy).

  In addition to SpaCy's pretrained language models, you can also use this component to
  attach spaCy models that you've trained yourself.

## Tokenizers

Tokenizers split text into tokens.
If you want to split intents into multiple labels, e.g. for predicting multiple intents or for
modeling hierarchical intent structure, use the following flags with any tokenizer:

* `intent_tokenization_flag` indicates whether to tokenize intent labels or not. Set it to `True`, so that intent
  labels are tokenized.

* `intent_split_symbol` sets the delimiter string to split the intent labels, default is underscore
  (`_`).


### WhitespaceTokenizer


* **Short**

  Tokenizer using whitespaces as a separator



* **Outputs**

  `tokens` for user messages, responses (if present), and intents (if specified)



* **Requires**

  Nothing



* **Description**

  Creates a token for every whitespace separated character sequence.

  Any character not in: `a-zA-Z0-9_#@&` will be substituted with whitespace before
  splitting on whitespace if the character fulfills any of the following conditions:
  - the character follows a whitespace: `" !word"` &#8594; `"word"`
  - the character precedes a whitespace: `"word! "` &#8594; `"word"`
  - the character is at the beginning of the string: `"!word"` &#8594; `"word"`
  - the character is at the end of the string: `"word!"` &#8594; `"word"`

  Note that:
  - `"wo!rd"` &#8594; `"wo!rd"`

  In addition, any character not in: `a-zA-Z0-9_#@&.~:\/?[]()!$*+,;=-` will be
  substituted with whitespace before splitting on whitespace if the character is not
  between numbers:
  - `"twenty{one"` &#8594; `"twenty"`, `"one"`         ("{"` is not between numbers)
  - `"20{1"` &#8594; `"20{1"`                          ("{"` *is* between numbers)

  Note that:
  - `"name@example.com"` &#8594; `"name@example.com"`
  - `"10,000.1"` &#8594; `"10,000.1"`
  - `"1 - 2"` &#8594; `"1"`,`"2"`


* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "WhitespaceTokenizer"
    # Flag to check whether to split intents
    "intent_tokenization_flag": False
    # Symbol on which intent should be split
    "intent_split_symbol": "_"
    # Regular expression to detect tokens
    "token_pattern": None
  ```


### JiebaTokenizer


* **Short**

  Tokenizer using Jieba for Chinese language



* **Outputs**

  `tokens` for user messages, responses (if present), and intents (if specified)



* **Requires**

  Nothing



* **Description**

  Creates tokens using the Jieba tokenizer specifically for Chinese
  language. It will only work for the Chinese language.

  :::note
  To use `JiebaTokenizer` you need to install Jieba with `pip3 install jieba`.

  :::



* **Configuration**

  User's custom dictionary files can be auto loaded by specifying the files' directory path via `dictionary_path`.
  If the `dictionary_path` is `None` (the default), then no custom dictionary will be used.

  ```yaml-rasa
  pipeline:
  - name: "JiebaTokenizer"
    dictionary_path: "path/to/custom/dictionary/dir"
    # Flag to check whether to split intents
    "intent_tokenization_flag": False
    # Symbol on which intent should be split
    "intent_split_symbol": "_"
    # Regular expression to detect tokens
    "token_pattern": None
  ```


### MitieTokenizer


* **Short**

  Tokenizer using MITIE



* **Outputs**

  `tokens` for user messages, responses (if present), and intents (if specified)



* **Requires**

  [MitieNLP](./components.mdx#mitienlp)



* **Description**

  Creates tokens using the MITIE tokenizer.



* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "MitieTokenizer"
    # Flag to check whether to split intents
    "intent_tokenization_flag": False
    # Symbol on which intent should be split
    "intent_split_symbol": "_"
    # Regular expression to detect tokens
    "token_pattern": None
  ```


### SpacyTokenizer


* **Short**

  Tokenizer using spaCy



* **Outputs**

  `tokens` for user messages, responses (if present), and intents (if specified)



* **Requires**

  [SpacyNLP](./components.mdx#spacynlp)



* **Description**

  Creates tokens using the spaCy tokenizer.



* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "SpacyTokenizer"
    # Flag to check whether to split intents
    "intent_tokenization_flag": False
    # Symbol on which intent should be split
    "intent_split_symbol": "_"
    # Regular expression to detect tokens
    "token_pattern": None
  ```


## Featurizers

Text featurizers are divided into two different categories: sparse featurizers and dense featurizers.
Sparse featurizers are featurizers that return feature vectors with a lot of missing values, e.g. zeros.
As those feature vectors would normally take up a lot of memory, we store them as sparse features.
Sparse features only store the values that are non zero and their positions in the vector.
Thus, we save a lot of memory and are able to train on larger datasets.

All featurizers can return two different kind of features: sequence features and sentence features.
The sequence features are a matrix of size `(number-of-tokens x feature-dimension)`.
The matrix contains a feature vector for every token in the sequence.
This allows us to train sequence models.
The sentence features are represented by a matrix of size `(1 x feature-dimension)`.
It contains the feature vector for the complete utterance.
The sentence features can be used in any bag-of-words model.
The corresponding classifier can therefore decide what kind of features to use.
Note: The `feature-dimension` for sequence and sentence features does not have to be the same.

### MitieFeaturizer


* **Short**

  Creates a vector representation of user message and response (if specified) using the MITIE featurizer.



* **Outputs**

  `dense_features` for user messages and responses



* **Requires**

  [MitieNLP](./components.mdx#mitienlp)



* **Type**

  Dense featurizer



* **Description**

  Creates features for entity extraction, intent classification, and response classification using the MITIE
  featurizer.

  :::note
  NOT used by the `MitieIntentClassifier` component. But can be used by any component later in the pipeline
  that makes use of `dense_features`.

  :::



* **Configuration**

  The sentence vector, i.e. the vector of the complete utterance, can be calculated in two different ways, either via
  mean or via max pooling. You can specify the pooling method in your configuration file with the option `pooling`.
  The default pooling method is set to `mean`.

  ```yaml-rasa
  pipeline:
  - name: "MitieFeaturizer"
    # Specify what pooling operation should be used to calculate the vector of
    # the complete utterance. Available options: 'mean' and 'max'.
    "pooling": "mean"
  ```


### SpacyFeaturizer


* **Short**

  Creates a vector representation of user message and response (if specified) using the spaCy featurizer.



* **Outputs**

  `dense_features` for user messages and responses



* **Requires**

  [SpacyNLP](./components.mdx#spacynlp)



* **Type**

  Dense featurizer



* **Description**

  Creates features for entity extraction, intent classification, and response classification using the spaCy
  featurizer.



* **Configuration**

  The sentence vector, i.e. the vector of the complete utterance, can be calculated in two different ways, either via
  mean or via max pooling. You can specify the pooling method in your configuration file with the option `pooling`.
  The default pooling method is set to `mean`.

  ```yaml-rasa
  pipeline:
  - name: "SpacyFeaturizer"
    # Specify what pooling operation should be used to calculate the vector of
    # the complete utterance. Available options: 'mean' and 'max'.
    "pooling": "mean"
  ```


### ConveRTFeaturizer


* **Short**

  Creates a vector representation of user message and response (if specified) using
  [ConveRT](https://github.com/PolyAI-LDN/polyai-models) model.



* **Outputs**

  `dense_features` for user messages and responses



* **Type**

  Dense featurizer



* **Description**

  Creates features for entity extraction, intent classification, and response selection.
  It uses the [default signature](https://github.com/PolyAI-LDN/polyai-models#tfhub-signatures) to compute vector
  representations of input text.

  :::note
  Since `ConveRT` model is trained only on an English corpus of conversations, this featurizer should only
  be used if your training data is in English language.

  :::

  :::note
  Note that this component cannot currently run on MacOS using M1 / M2 architecture. 
  More information on this limitation is available [here](./installation/environment-set-up.mdx#m1--m2-apple-silicon-limitations).
  
  :::



* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "ConveRTFeaturizer"
  # Remote URL/Local directory of model files(Required)
  "model_url": None
  ```

  :::caution
  Since the public URL of the ConveRT model was taken offline recently, it is now mandatory
  to set the parameter `model_url` to a community/self-hosted URL or path to a local directory containing model files.

  :::

  :::

### LanguageModelFeaturizer


* **Short**

  Creates a vector representation of user message and response (if specified) using a pre-trained language model.



* **Outputs**

  `dense_features` for user messages and responses


* **Type**

  Dense featurizer



* **Description**

  Creates features for entity extraction, intent classification, and response selection.
  Uses a pre-trained language model to compute vector representations of input text.

  :::note
  Please make sure that you use a language model which is pre-trained on the same language corpus as that of your
  training data.

  :::



* **Configuration**

  Include a [Tokenizer](./components.mdx#tokenizers) component before this component.

  You should specify what language model to load via the parameter `model_name`. See the below table for the
  currently supported language models. The weights to be loaded can be specified by the additional parameter
  `model_weights`. If left empty, it uses the default model weights listed in the table.

  ```
  +----------------+--------------+-------------------------+
  | Language Model | Parameter    | Default value for       |
  |                | "model_name" | "model_weights"         |
  +----------------+--------------+-------------------------+
  | BERT           | bert         | rasa/LaBSE              |
  +----------------+--------------+-------------------------+
  | GPT            | gpt          | openai-gpt              |
  +----------------+--------------+-------------------------+
  | GPT-2          | gpt2         | gpt2                    |
  +----------------+--------------+-------------------------+
  | XLNet          | xlnet        | xlnet-base-cased        |
  +----------------+--------------+-------------------------+
  | DistilBERT     | distilbert   | distilbert-base-uncased |
  +----------------+--------------+-------------------------+
  | RoBERTa        | roberta      | roberta-base            |
  +----------------+--------------+-------------------------+
  | camemBERT      | camembert    | camembert-base          |
  +----------------+--------------+-------------------------+
  ```

  Apart from the default pretrained model weights, further models can be used from
  [HuggingFace models](https://huggingface.co/models) provided the following conditions are met (the mentioned
  files can be found in the "Files and versions" section of the model website):

  * The model architecture is one of the supported language models (check that the `model_type` in `config.json` is
    listed in the table's column `model_name`)
  * The model has pretrained Tensorflow weights (check that the file `tf_model.h5` exists)
  * The model uses the default tokenizer (`config.json` should not contain a custom `tokenizer_class` setting)

  :::note
  The `LaBSE` weights that are loaded as default for the `bert` architecture provide a multi-lingual model trained on
  112 languages (see our [tutorial](https://www.youtube.com/watch?v=7tAWk_Coj-s) and the original
  [paper](https://arxiv.org/pdf/2007.01852.pdf)). We strongly encourage using this as a baseline and testing your bot
  end-to-end before trying to optimize this component with other weights/architectures.
  :::


  The following configuration loads the language model BERT with `rasa/LaBSE` weights, which can be found
  [here](https://huggingface.co/rasa/LaBSE/tree/main):

  ```yaml-rasa
  pipeline:
    - name: LanguageModelFeaturizer
      # Name of the language model to use
      model_name: "bert"
      # Pre-Trained weights to be loaded
      model_weights: "rasa/LaBSE"

      # An optional path to a directory from which
      # to load pre-trained model weights.
      # If the requested model is not found in the
      # directory, it will be downloaded and
      # cached in this directory for future use.
      # The default value of `cache_dir` can be
      # set using the environment variable
      # `TRANSFORMERS_CACHE`, as per the
      # Transformers library.
      cache_dir: null
  ```

### RegexFeaturizer


* **Short**

  Creates a vector representation of user message using regular expressions.



* **Outputs**

  `sparse_features` for user messages and `tokens.pattern`



* **Requires**

  `tokens`



* **Type**

  Sparse featurizer



* **Description**

  Creates features for entity extraction and intent classification.
  During training the `RegexFeaturizer` creates a list of regular expressions defined in the training
  data format.
  For each regex, a feature will be set marking whether this expression was found in the user message or not.
  All features will later be fed into an intent classifier / entity extractor to simplify classification (assuming
  the classifier has learned during the training phase, that this set feature indicates a certain intent / entity).
  Regex features for entity extraction are currently only supported by the [CRFEntityExtractor](./components.mdx#crfentityextractor) and the
  [DIETClassifier](./components.mdx#dietclassifier) components!



* **Configuration**

  Make the featurizer case insensitive by adding the `case_sensitive: False` option, the default being
  `case_sensitive: True`.

  To correctly process languages such as Chinese that don't use whitespace for word separation,
  the user needs to add the `use_word_boundaries: False` option, the default being `use_word_boundaries: True`.

  ```yaml-rasa
  pipeline:
  - name: "RegexFeaturizer"
    # Text will be processed with case sensitive as default
    "case_sensitive": True
    # use match word boundaries for lookup table
    "use_word_boundaries": True
  ```

  **Configuring for incremental training**

  To ensure that `sparse_features` are of fixed size during
  [incremental training](./command-line-interface.mdx#incremental-training), the
  component should be configured to account for additional patterns that may be
  added to the training data in future. To do so, configure the `number_additional_patterns`
  parameter while training the base model from scratch:

  ```yaml-rasa {3}
  pipeline:
  - name: RegexFeaturizer
    number_additional_patterns: 10
  ```

  If not configured by the user, the component will use twice the number of
  patterns currently present in the training data (including lookup tables and regex patterns)
  as the default value for `number_additional_patterns`.
  This number is kept at a minimum of 10 in order to avoid running out of additional
  slots for new patterns too frequently during incremental training.
  Once the component runs out of additional pattern slots, the new patterns are dropped
  and not considered during featurization. At this point, it is advisable
  to retrain a new model from scratch.


### CountVectorsFeaturizer


* **Short**

  Creates bag-of-words representation of user messages, intents, and responses.



* **Outputs**

  `sparse_features` for user messages, intents, and responses



* **Requires**

  `tokens`



* **Type**

  Sparse featurizer



* **Description**

  Creates features for intent classification and response selection.
  Creates bag-of-words representation of user message, intent, and response using
  [sklearn's CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
  All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.



* **Configuration**

  See [sklearn's CountVectorizer docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
  for detailed description of the configuration parameters.

  This featurizer can be configured to use word or character n-grams, using the `analyzer` configuration parameter.
  By default `analyzer` is set to `word` so word token counts are used as features.
  If you want to use character n-grams, set `analyzer` to `char` or `char_wb`.
  The lower and upper boundaries of the n-grams can be configured via the parameters `min_ngram` and `max_ngram`.
  By default both of them are set to `1`.
  By default the featurizer takes the lemma of a word instead of the word directly if it is available. The lemma
  of a word is currently only set by the [SpacyTokenizer](./components.mdx#spacytokenizer). You can
  disable this behavior by setting `use_lemma` to `False`.

  :::note
  Option `char_wb` creates character n-grams only from text inside word boundaries;
  n-grams at the edges of words are padded with space.
  This option can be used to create [Subword Semantic Hashing](https://arxiv.org/abs/1810.07150).

  :::

  :::note
  For character n-grams do not forget to increase `min_ngram` and `max_ngram` parameters.
  Otherwise the vocabulary will contain only single letters.

  :::

  Handling Out-Of-Vocabulary (OOV) words:

  :::note
  Enabled only if `analyzer` is `word`.

  :::

  Since the training is performed on limited vocabulary data, it cannot be guaranteed that during prediction
  an algorithm will not encounter an unknown word (a word that were not seen during training).
  In order to teach an algorithm how to treat unknown words, some words in training data can be substituted
  by generic word `OOV_token`.
  In this case during prediction all unknown words will be treated as this generic word `OOV_token`.

  For example, one might create separate intent `outofscope` in the training data containing messages of
  different number of `OOV_token` s and maybe some additional general words.
  Then an algorithm will likely classify a message with unknown words as this intent `outofscope`.

  You can either set the `OOV_token` or a list of words `OOV_words`:

  * `OOV_token` set a keyword for unseen words; if training data contains `OOV_token` as words in some
    messages, during prediction the words that were not seen during training will be substituted with
    provided `OOV_token`; if `OOV_token=None` (default behavior) words that were not seen during
    training will be ignored during prediction time;

  * `OOV_words` set a list of words to be treated as `OOV_token` during training; if a list of words
    that should be treated as Out-Of-Vocabulary is known, it can be set to `OOV_words` instead of manually
    changing it in training data or using custom preprocessor.

  :::note
  This featurizer creates a bag-of-words representation by **counting** words,
  so the number of `OOV_token` in the sentence might be important.

  :::

  :::note
  Providing `OOV_words` is optional, training data can contain `OOV_token` input manually or by custom
  additional preprocessor.
  Unseen words will be substituted with `OOV_token` **only** if this token is present in the training
  data or `OOV_words` list is provided.

  :::

  If you want to share the vocabulary between user messages and intents, you need to set the option
  `use_shared_vocab` to `True`. In that case a common vocabulary set between tokens in intents and user messages
  is build.

  ```yaml-rasa
  pipeline:
  - name: "CountVectorsFeaturizer"
    # Analyzer to use, either 'word', 'char', or 'char_wb'
    "analyzer": "word"
    # Set the lower and upper boundaries for the n-grams
    "min_ngram": 1
    "max_ngram": 1
    # Set the out-of-vocabulary token
    "OOV_token": "_oov_"
    # Whether to use a shared vocab
    "use_shared_vocab": False
  ```

  **Configuring for incremental training**


  To ensure that `sparse_features` are of fixed size during
  [incremental training](./command-line-interface.mdx#incremental-training), the
  component should be configured to account for additional vocabulary tokens
  that may be added as part of new training examples in the future.
  To do so, configure the `additional_vocabulary_size` parameter while training the base model from scratch:

  ```yaml-rasa {3-6}
  pipeline:
  - name: CountVectorsFeaturizer
    additional_vocabulary_size:
      text: 1000
      response: 1000
      action_text: 1000
  ```

  As in the above example, you can define additional vocabulary size for each of
  `text` (user messages), `response` (bot responses used by `ResponseSelector`) and
  `action_text` (bot responses not used by `ResponseSelector`). If you are building a shared
  vocabulary (`use_shared_vocab=True`), you only need to define a value for the `text` attribute.
  If any of the attribute is not configured by the user, the component takes half of the current
  vocabulary size as the default value for the attribute's `additional_vocabulary_size`.
  This number is kept at a minimum of 1000 in order to avoid running out of additional vocabulary
  slots too frequently during incremental training. Once the component runs out of additional vocabulary slots,
  the new vocabulary tokens are dropped and not considered during featurization. At this point,
  it is advisable to retrain a new model from scratch.


The above configuration parameters are the ones you should configure to fit your model to your data.
However, additional parameters exist that can be adapted.

<details><summary>More configurable parameters</summary>

```
+---------------------------+-------------------------+--------------------------------------------------------------+
| Parameter                 | Default Value           | Description                                                  |
+===========================+=========================+==============================================================+
| use_shared_vocab          | False                   | If set to 'True' a common vocabulary is used for labels      |
|                           |                         | and user message.                                            |
+---------------------------+-------------------------+--------------------------------------------------------------+
| analyzer                  | word                    | Whether the features should be made of word n-gram or        |
|                           |                         | character n-grams. Option 'char_wb' creates character        |
|                           |                         | n-grams only from text inside word boundaries;               |
|                           |                         | n-grams at the edges of words are padded with space.         |
|                           |                         | Valid values: 'word', 'char', 'char_wb'.                     |
+---------------------------+-------------------------+--------------------------------------------------------------+
| strip_accents             | None                    | Remove accents during the pre-processing step.               |
|                           |                         | Valid values: 'ascii', 'unicode', 'None'.                    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| stop_words                | None                    | A list of stop words to use.                                 |
|                           |                         | Valid values: 'english' (uses an internal list of            |
|                           |                         | English stop words), a list of custom stop words, or         |
|                           |                         | 'None'.                                                      |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_df                    | 1                       | When building the vocabulary ignore terms that have a        |
|                           |                         | document frequency strictly lower than the given threshold.  |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_df                    | 1                       | When building the vocabulary ignore terms that have a        |
|                           |                         | document frequency strictly higher than the given threshold  |
|                           |                         | (corpus-specific stop words).                                |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_ngram                 | 1                       | The lower boundary of the range of n-values for different    |
|                           |                         | word n-grams or char n-grams to be extracted.                |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_ngram                 | 1                       | The upper boundary of the range of n-values for different    |
|                           |                         | word n-grams or char n-grams to be extracted.                |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_features              | None                    | If not 'None', build a vocabulary that only consider the top |
|                           |                         | max_features ordered by term frequency across the corpus.    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| lowercase                 | True                    | Convert all characters to lowercase before tokenizing.       |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_token                 | None                    | Keyword for unseen words.                                    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_words                 | []                      | List of words to be treated as 'OOV_token' during training.  |
+---------------------------+-------------------------+--------------------------------------------------------------+
| alias                     | CountVectorFeaturizer   | Alias name of featurizer.                                    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| use_lemma                 | True                    | Use the lemma of words for featurization.                    |
+---------------------------+-------------------------+--------------------------------------------------------------+
| additional_vocabulary_size| text: 1000              | Size of additional vocabulary to account for incremental     |
|                           | response: 1000          | training while training a model from scratch                 |
|                           | action_text: 1000       |                                                              |
+---------------------------+-------------------------+--------------------------------------------------------------+
```

</details>


### LexicalSyntacticFeaturizer


* **Short**

  Creates lexical and syntactic features for a user message to support entity extraction.



* **Outputs**

  `sparse_features` for user messages



* **Requires**

  `tokens`



* **Type**

  Sparse featurizer



* **Description**

  Creates features for entity extraction.
  Moves with a sliding window over every token in the user message and creates features according to the
  configuration (see below). As a default configuration is present, you don't need to specify a configuration.



* **Configuration**

  You can configure what kind of lexical and syntactic features the featurizer should extract.
  The following features are available:

  ```
  ==============  ==========================================================================================
  Feature Name    Description
  ==============  ==========================================================================================
  BOS             Checks if the token is at the beginning of the sentence.
  EOS             Checks if the token is at the end of the sentence.
  low             Checks if the token is lower case.
  upper           Checks if the token is upper case.
  title           Checks if the token starts with an uppercase character and all remaining characters are
                  lowercased.
  digit           Checks if the token contains just digits.
  prefix5         Take the first five characters of the token.
  prefix2         Take the first two characters of the token.
  suffix5         Take the last five characters of the token.
  suffix3         Take the last three characters of the token.
  suffix2         Take the last two characters of the token.
  suffix1         Take the last character of the token.
  pos             Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
  pos2            Take the first two characters of the Part-of-Speech tag of the token
                  (``SpacyTokenizer`` required).
  ==============  ==========================================================================================
  ```

  As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for
  previous tokens, the current token, and the next tokens in the sliding window.
  You define the features as a [before, token, after] array.
  If you want to define features for the token before, the current token, and the token after,
  your features configuration would look like this:

  ```yaml-rasa
  pipeline:
  - name: LexicalSyntacticFeaturizer
    "features": [
      ["low", "title", "upper"],
      ["BOS", "EOS", "low", "upper", "title", "digit"],
      ["low", "title", "upper"],
    ]
  ```

  This configuration is also the default configuration.

  :::note
  If you want to make use of `pos` or `pos2` you need to add `SpacyTokenizer` to your pipeline.

  :::

## Intent Classifiers

Intent classifiers assign one of the intents defined in the domain file to incoming user messages.

### MitieIntentClassifier


* **Short**

  MITIE intent classifier (using a
  [text categorizer](https://github.com/mit-nlp/MITIE/blob/master/examples/python/text_categorizer_pure_model.py))



* **Outputs**

  `intent`



* **Requires**

  `tokens` for user message and [MitieNLP](./components.mdx#mitienlp)



* **Output-Example**

  ```json
  {
      "intent": {"name": "greet", "confidence": 0.98343}
  }
  ```



* **Description**

  This classifier uses MITIE to perform intent classification. The underlying classifier
  is using a multi-class linear SVM with a sparse linear kernel (see `train_text_categorizer_classifier` function at the
  [MITIE trainer code](https://github.com/mit-nlp/MITIE/blob/master/mitielib/src/text_categorizer_trainer.cpp)).

  :::note
  This classifier does not rely on any featurizer as it extracts features on its own.

  :::



* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "MitieIntentClassifier"
  ```


### LogisticRegressionClassifier

* **Short**

  Logistic regression intent classifier, using the [scikit-learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).



* **Outputs**

  `intent` and `intent_ranking`



* **Requires** 

  Either `sparse_features` or `dense_features` need to be present. 



* **Output-Example** 

```json
{
    "intent": {"name": "greet", "confidence": 0.780},
    "intent_ranking": [
        {
            "confidence": 0.780,
            "name": "greet"
        },
        {
            "confidence": 0.140,
            "name": "goodbye"
        },
        {
            "confidence": 0.080,
            "name": "restaurant_search"
        }
    ]
}
```


* **Description**

  This classifier uses scikit-learn's logistic regression implementation to perform intent classification.
  It's able to use only sparse features, but will also pick up any dense features that are present. In general,
  DIET should yield higher accuracy results, but this classifier should train faster and may be used as
  a lightweight benchmark. Our implementation uses the base settings from scikit-learn, with the exception
  of the `class_weight` parameter where we assume the `"balanced"` setting.



* **Configuration**

An example configuration with all the defaults can be found below.

```yaml
pipeline:
- name: LogisticRegressionClassifier
  max_iter: 100
  solver: lbfgs
  tol: 0.0001
  random_state: 42
  ranking_length: 10
```

There configuration parameters are briefly explained below. 

- `max_iter`: Maximum number of iterations taken for the solvers to converge.
- `solver`: Solver to be used. For very small datasets you might consider `liblinear`.
- `tol`: Tolerance for stopping criteria of the optimizer.
- `random_state`: Used to shuffle the data before training.
- `ranking_length`: Number of top intents to report. Set to 0 to report all intents

More details on the parameters can be found on the [scikit-learn documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).



### SklearnIntentClassifier


* **Short**

  Sklearn intent classifier



* **Outputs**

  `intent` and `intent_ranking`



* **Requires**

  `dense_features` for user messages



* **Output-Example**

  ```json
  {
      "intent": {"name": "greet", "confidence": 0.7800},
      "intent_ranking": [
          {
              "confidence": 0.7800,
              "name": "greet"
          },
          {
              "confidence": 0.1400,
              "name": "goodbye"
          },
          {
              "confidence": 0.0800,
              "name": "restaurant_search"
          }
      ]
  }
  ```



* **Description**

  The sklearn intent classifier trains a linear SVM which gets optimized using a grid search. It also provides
  rankings of the labels that did not “win”. The `SklearnIntentClassifier` needs to be preceded by a dense
  featurizer in the pipeline. This dense featurizer creates the features used for the classification.
  For more information about the algorithm itself, take a look at the
  [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
  documentation.



* **Configuration**

  During the training of the SVM a hyperparameter search is run to find the best parameter set.
  In the configuration you can specify the parameters that will get tried.

  ```yaml-rasa
  pipeline:
  - name: "SklearnIntentClassifier"
    # Specifies the list of regularization values to
    # cross-validate over for C-SVM.
    # This is used with the ``kernel`` hyperparameter in GridSearchCV.
    C: [1, 2, 5, 10, 20, 100]
    # Specifies the kernel to use with C-SVM.
    # This is used with the ``C`` hyperparameter in GridSearchCV.
    kernels: ["linear"]
    # Gamma parameter of the C-SVM.
    "gamma": [0.1]
    # We try to find a good number of cross folds to use during
    # intent training, this specifies the max number of folds.
    "max_cross_validation_folds": 5
    # Scoring function used for evaluating the hyper parameters.
    # This can be a name or a function.
    "scoring_function": "f1_weighted"
  ```


### KeywordIntentClassifier


* **Short**

  Simple keyword matching intent classifier, intended for small, short-term projects.



* **Outputs**

  `intent`



* **Requires**

  Nothing



* **Output-Example**

  ```json
  {
      "intent": {"name": "greet", "confidence": 1.0}
  }
  ```



* **Description**

  This classifier works by searching a message for keywords.
  The matching is case sensitive by default and searches only for exact matches of the keyword-string in the user
  message.
  The keywords for an intent are the examples of that intent in the NLU training data.
  This means the entire example is the keyword, not the individual words in the example.

  :::note
  This classifier is intended only for small projects or to get started. If
  you have few NLU training data, you can take a look at the recommended pipelines in
  [Tuning Your Model](./tuning-your-model.mdx).

  :::



* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "KeywordIntentClassifier"
    case_sensitive: True
  ```


### DIETClassifier


* **Short**

  Dual Intent Entity Transformer (DIET) used for intent classification and entity extraction



* **Outputs**

  `entities`, `intent` and `intent_ranking`



* **Requires**

  `dense_features` and/or `sparse_features` for user message and optionally the intent



* **Output-Example**

  ```json
  {
      "intent": {"name": "greet", "confidence": 0.7800},
      "intent_ranking": [
          {
              "confidence": 0.7800,
              "name": "greet"
          },
          {
              "confidence": 0.1400,
              "name": "goodbye"
          },
          {
              "confidence": 0.0800,
              "name": "restaurant_search"
          }
      ],
      "entities": [{
          "end": 53,
          "entity": "time",
          "start": 48,
          "value": "2017-04-10T00:00:00.000+02:00",
          "confidence": 1.0,
          "extractor": "DIETClassifier"
      }]
  }
  ```



* **Description**

  DIET (Dual Intent and Entity Transformer) is a multi-task architecture for intent classification and entity
  recognition. The architecture is based on a transformer which is shared for both tasks.
  A sequence of entity labels is predicted through a Conditional Random Field (CRF) tagging layer on top of the
  transformer output sequence corresponding to the input sequence of tokens.
  For the intent labels the transformer output for the complete utterance and intent labels are embedded into a
  single semantic vector space. We use the dot-product loss to maximize the similarity with the target label and
  minimize similarities with negative samples.

  DIET does not provide pre-trained word embeddings or pre-trained language models but it is able to use these features if
  they are added to the pipeline. If you want to learn more about the model, check out the
  [Algorithm Whiteboard](https://www.youtube.com/playlist?list=PL75e0qA87dlG-za8eLI6t0_Pbxafk-cxb) series on YouTube,
  where we explain the model architecture in detail.

  :::note
  If during prediction time a message contains **only** words unseen during training
  and no Out-Of-Vocabulary preprocessor was used, an empty intent `None` is predicted with confidence
  `0.0`. This might happen if you only use the [CountVectorsFeaturizer](./components.mdx#countvectorsfeaturizer) with a `word` analyzer
  as featurizer. If you use the `char_wb` analyzer, you should always get an intent with a confidence
  value `> 0.0`.

  :::

* **Configuration**

  If you want to use the `DIETClassifier` just for intent classification, set `entity_recognition` to `False`.
  If you want to do only entity recognition, set `intent_classification` to `False`.
  By default `DIETClassifier` does both, i.e. `entity_recognition` and `intent_classification` are set to
  `True`.

  You can define a number of hyperparameters to adapt the model.
  If you want to adapt your model, start by modifying the following parameters:

  * `epochs`:
    This parameter sets the number of times the algorithm will see the training data (default: `300`).
    One `epoch` is equals to one forward pass and one backward pass of all the training examples.
    Sometimes the model needs more epochs to properly learn.
    Sometimes more epochs don't influence the performance.
    The lower the number of epochs the faster the model is trained.

  * `hidden_layers_sizes`:
    This parameter allows you to define the number of feed forward layers and their output
    dimensions for user messages and intents (default: `text: [], label: []`).
    Every entry in the list corresponds to a feed forward layer.
    For example, if you set `text: [256, 128]`, we will add two feed forward layers in front of
    the transformer. The vectors of the input tokens (coming from the user message) will be passed on to those
    layers. The first layer will have an output dimension of 256 and the second layer will have an output
    dimension of 128. If an empty list is used (default behavior), no feed forward layer will be
    added.
    Make sure to use only positive integer values. Usually, numbers of power of two are used.
    Also, it is usual practice to have decreasing values in the list: next value is smaller or equal to the
    value before.

  * `embedding_dimension`:
    This parameter defines the output dimension of the embedding layers used inside the model (default: `20`).
    We are using multiple embeddings layers inside the model architecture.
    For example, the vector of the complete utterance and the intent is passed on to an embedding layer before
    they are compared and the loss is calculated.

  * `number_of_transformer_layers`:
    This parameter sets the number of transformer layers to use (default: `2`).
    The number of transformer layers corresponds to the transformer blocks to use for the model.

  * `transformer_size`:
    This parameter sets the number of units in the transformer (default: `256`).
    The vectors coming out of the transformers will have the given `transformer_size`.
    The `transformer_size` should be a multiple of the `number_of_attention_heads` parameter,
    the training exits with an error otherwise.

  * `connection_density`:
    This parameter defines the fraction of kernel weights that are set to non zero values for all feed forward
    layers in the model (default: `0.2`). The value should be between 0 and 1. If you set `connection_density`
    to 1, no kernel weights will be set to 0, the layer acts as a standard feed forward layer. You should not
    set `connection_density` to 0 as this would result in all kernel weights being 0, i.e. the model is not able
    to learn.

  * `constrain_similarities`:
    This parameter when set to `True` applies a sigmoid cross entropy loss over all similarity terms.
    This helps in keeping similarities between input and negative labels to smaller values.
    This should help in better generalization of the model to real world test sets.

  * `model_confidence`:
    This parameter allows the user to configure how confidences are computed during inference. It can take only 
    one value as input which is `softmax`[^1]. In `softmax`, confidences are in the range `[0, 1]`. The computed 
    similarities are normalized with the `softmax` activation function.
    
[^1]: Note that `model_confidence: cosine`is deprecated in 2.3.4 (see [changelog](./changelog.mdx#234---2021-02-26))
    and cannot be specified in the config, however `model_confidence: cosine` will be used regardless of the config
    if `loss_type: margin` is specified. 

The above configuration parameters are the ones you should configure to fit your model to your data.
However, additional parameters exist that can be adapted.

<details><summary>More configurable parameters</summary>

```
+---------------------------------+------------------+--------------------------------------------------------------+
| Parameter                       | Default Value    | Description                                                  |
+=================================+==================+==============================================================+
| hidden_layers_sizes             | text: []         | Hidden layer sizes for layers before the embedding layers    |
|                                 | label: []        | for user messages and labels. The number of hidden layers is |
|                                 |                  | equal to the length of the corresponding list.               |
+---------------------------------+------------------+--------------------------------------------------------------+
| share_hidden_layers             | False            | Whether to share the hidden layer weights between user       |
|                                 |                  | messages and labels.                                         |
+---------------------------------+------------------+--------------------------------------------------------------+
| transformer_size                | 256              | Number of units in transformer.                              |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_transformer_layers    | 2                | Number of transformer layers.                                |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_attention_heads       | 4                | Number of attention heads in transformer.                    |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_key_relative_attention      | False            | If 'True' use key relative embeddings in attention.          |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_value_relative_attention    | False            | If 'True' use value relative embeddings in attention.        |
+---------------------------------+------------------+--------------------------------------------------------------+
| max_relative_position           | None             | Maximum position for relative embeddings.                    |
+---------------------------------+------------------+--------------------------------------------------------------+
| unidirectional_encoder          | False            | Use a unidirectional or bidirectional encoder.               |
+---------------------------------+------------------+--------------------------------------------------------------+
| batch_size                      | [64, 256]        | Initial and final value for batch sizes.                     |
|                                 |                  | Batch size will be linearly increased for each epoch.        |
|                                 |                  | If constant `batch_size` is required, pass an int, e.g. `8`. |
+---------------------------------+------------------+--------------------------------------------------------------+
| batch_strategy                  | "balanced"       | Strategy used when creating batches.                         |
|                                 |                  | Can be either 'sequence' or 'balanced'.                      |
+---------------------------------+------------------+--------------------------------------------------------------+
| epochs                          | 300              | Number of epochs to train.                                   |
+---------------------------------+------------------+--------------------------------------------------------------+
| random_seed                     | None             | Set random seed to any 'int' to get reproducible results.    |
+---------------------------------+------------------+--------------------------------------------------------------+
| learning_rate                   | 0.001            | Initial learning rate for the optimizer.                     |
+---------------------------------+------------------+--------------------------------------------------------------+
| embedding_dimension             | 20               | Dimension size of embedding vectors.                         |
+---------------------------------+------------------+--------------------------------------------------------------+
| dense_dimension                 | text: 128        | Dense dimension for sparse features to use.                  |
|                                 | label: 20        |                                                              |
+---------------------------------+------------------+--------------------------------------------------------------+
| concat_dimension                | text: 128        | Concat dimension for sequence and sentence features.         |
|                                 | label: 20        |                                                              |
+---------------------------------+------------------+--------------------------------------------------------------+
| number_of_negative_examples     | 20               | The number of incorrect labels. The algorithm will minimize  |
|                                 |                  | their similarity to the user input during training.          |
+---------------------------------+------------------+--------------------------------------------------------------+
| similarity_type                 | "auto"           | Type of similarity measure to use, either 'auto' or 'cosine' |
|                                 |                  | or 'inner'.                                                  |
+---------------------------------+------------------+--------------------------------------------------------------+
| loss_type                       | "cross_entropy"  | The type of the loss function, either 'cross_entropy'        |
|                                 |                  | or 'margin'. If type 'margin' is specified,                  |
|                                 |                  | "model_confidence=cosine" will be used which is deprecated   |
|                                 |                  | as of 2.3.4. See footnote (1).                               |
+---------------------------------+------------------+--------------------------------------------------------------+
| ranking_length                  | 10               | Number of top intents to report. Set to 0 to report all      |
|                                 |                  | intents.                                                     |
+---------------------------------+------------------+--------------------------------------------------------------+
| renormalize_confidences         | False            | Normalize the reported top intents. Applicable only with loss|
|                                 |                  | type 'cross_entropy' and 'softmax' confidences.              |
+---------------------------------+------------------+--------------------------------------------------------------+
| maximum_positive_similarity     | 0.8              | Indicates how similar the algorithm should try to make       |
|                                 |                  | embedding vectors for correct labels.                        |
|                                 |                  | Should be 0.0 < ... < 1.0 for 'cosine' similarity type.      |
+---------------------------------+------------------+--------------------------------------------------------------+
| maximum_negative_similarity     | -0.4             | Maximum negative similarity for incorrect labels.            |
|                                 |                  | Should be -1.0 < ... < 1.0 for 'cosine' similarity type.     |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_maximum_negative_similarity | True             | If 'True' the algorithm only minimizes maximum similarity    |
|                                 |                  | over incorrect intent labels, used only if 'loss_type' is    |
|                                 |                  | set to 'margin'.                                             |
+---------------------------------+------------------+--------------------------------------------------------------+
| scale_loss                      | False            | Scale loss inverse proportionally to confidence of correct   |
|                                 |                  | prediction.                                                  |
+---------------------------------+------------------+--------------------------------------------------------------+
| regularization_constant         | 0.002            | The scale of regularization.                                 |
+---------------------------------+------------------+--------------------------------------------------------------+
| negative_margin_scale           | 0.8              | The scale of how important it is to minimize the maximum     |
|                                 |                  | similarity between embeddings of different labels.           |
+---------------------------------+------------------+--------------------------------------------------------------+
| connection_density              | 0.2              | Connection density of the weights in dense layers.           |
|                                 |                  | Value should be between 0 and 1.                             |
+---------------------------------+------------------+--------------------------------------------------------------+
| drop_rate                       | 0.2              | Dropout rate for encoder. Value should be between 0 and 1.   |
|                                 |                  | The higher the value the higher the regularization effect.   |
+---------------------------------+------------------+--------------------------------------------------------------+
| drop_rate_attention             | 0.0              | Dropout rate for attention. Value should be between 0 and 1. |
|                                 |                  | The higher the value the higher the regularization effect.   |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_sparse_input_dropout        | True             | If 'True' apply dropout to sparse input tensors.             |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_dense_input_dropout         | True             | If 'True' apply dropout to dense input tensors.              |
+---------------------------------+------------------+--------------------------------------------------------------+
| evaluate_every_number_of_epochs | 20               | How often to calculate validation accuracy.                  |
|                                 |                  | Set to '-1' to evaluate just once at the end of training.    |
+---------------------------------+------------------+--------------------------------------------------------------+
| evaluate_on_number_of_examples  | 0                | How many examples to use for hold out validation set.        |
|                                 |                  | Large values may hurt performance, e.g. model accuracy.      |
+---------------------------------+------------------+--------------------------------------------------------------+
| intent_classification           | True             | If 'True' intent classification is trained and intents are   |
|                                 |                  | predicted.                                                   |
+---------------------------------+------------------+--------------------------------------------------------------+
| entity_recognition              | True             | If 'True' entity recognition is trained and entities are     |
|                                 |                  | extracted.                                                   |
+---------------------------------+------------------+--------------------------------------------------------------+
| use_masked_language_model       | False            | If 'True' random tokens of the input message will be masked  |
|                                 |                  | and the model has to predict those tokens. It acts like a    |
|                                 |                  | regularizer and should help to learn a better contextual     |
|                                 |                  | representation of the input.                                 |
+---------------------------------+------------------+--------------------------------------------------------------+
| tensorboard_log_directory       | None             | If you want to use tensorboard to visualize training         |
|                                 |                  | metrics, set this option to a valid output directory. You    |
|                                 |                  | can view the training metrics after training in tensorboard  |
|                                 |                  | via 'tensorboard --logdir <path-to-given-directory>'.        |
+---------------------------------+------------------+--------------------------------------------------------------+
| tensorboard_log_level           | "epoch"          | Define when training metrics for tensorboard should be       |
|                                 |                  | logged. Either after every epoch ('epoch') or for every      |
|                                 |                  | training step ('batch').                                 |
+---------------------------------+------------------+--------------------------------------------------------------+
| featurizers                     | []               | List of featurizer names (alias names). Only features        |
|                                 |                  | coming from the listed names are used. If list is empty      |
|                                 |                  | all available features are used.                             |
+---------------------------------+------------------+--------------------------------------------------------------+
| checkpoint_model                | False            | Save the best performing model during training. Models are   |
|                                 |                  | stored to the location specified by `--out`. Only the one    |
|                                 |                  | best model will be saved.                                    |
|                                 |                  | Requires `evaluate_on_number_of_examples > 0` and            |
|                                 |                  | `evaluate_every_number_of_epochs > 0`                        |
+---------------------------------+------------------+--------------------------------------------------------------+
| split_entities_by_comma         | True             | Splits a list of extracted entities by comma to treat each   |
|                                 |                  | one of them as a single entity. Can either be `True`/`False` |
|                                 |                  | globally, or set per entity type, such as:                   |
|                                 |                  | ```                                                          |
|                                 |                  | ...                                                          |
|                                 |                  | - name: DIETClassifier                                       |
|                                 |                  |   split_entities_by_comma:                                   |
|                                 |                  |     address: True                                            |
|                                 |                  |     ...                                                      |
|                                 |                  | ...                                                          |
|                                 |                  | ```                                                          |
+---------------------------------+------------------+--------------------------------------------------------------+
| constrain_similarities          | False            | If `True`, applies sigmoid on all similarity terms and adds  |
|                                 |                  | it to the loss function to ensure that similarity values are |
|                                 |                  | approximately bounded. Used only if `loss_type=cross_entropy`|
+---------------------------------+------------------+--------------------------------------------------------------+
| model_confidence                | "softmax"        | Affects how model's confidence for each intent               |
|                                 |                  | is computed. Currently, only one value is supported:         |
|                                 |                  | 1. `softmax` - Similarities between input and intent         |
|                                 |                  | embeddings are post-processed with a softmax function,       |
|                                 |                  | as a result of which confidence for all intents sum up to 1. |
|                                 |                  | This parameter does not affect the confidence for entity     |
|                                 |                  | prediction.                                                  |
+---------------------------------+------------------+--------------------------------------------------------------+
```

:::note
Parameter `maximum_negative_similarity` is set to a negative value to mimic the original
starspace algorithm in the case `maximum_negative_similarity = maximum_positive_similarity`
and `use_maximum_negative_similarity = False`.
See [starspace paper](https://arxiv.org/abs/1709.03856) for details.

:::

</details>

### FallbackClassifier

* **Short**

  Classifies a message with the intent `nlu_fallback` if the NLU intent classification
  scores are ambiguous. The confidence is set to be the same as the `fallback threshold`.

* **Outputs**

  `entities`, `intent` and `intent_ranking`

* **Requires**

  `intent` and `intent_ranking` output from a previous intent classifier

* **Output-Example**

    ```json

        {
            "intent": {"name": "nlu_fallback", "confidence": 0.7183846840434321},
            "intent_ranking": [
                {
                    "confidence": 0.7183846840434321,
                    "name": "nlu_fallback"
                },
                {
                    "confidence": 0.28161531595656784,
                    "name": "restaurant_search"
                }
            ],
            "entities": [{
                "end": 53,
                "entity": "time",
                "start": 48,
                "value": "2017-04-10T00:00:00.000+02:00",
                "confidence": 1.0,
                "extractor": "DIETClassifier"
            }]
        }
    ```

* **Description**

    The `FallbackClassifier` classifies a user message with the intent `nlu_fallback`
    in case the previous intent classifier wasn't
    able to classify an intent with a confidence greater or equal than the `threshold`
    of the `FallbackClassifier`. It can also predict the fallback intent in the
    case when the confidence scores of the two top ranked intents are closer than the the
    `ambiguity_threshold`.

    You can use the `FallbackClassifier` to implement a
    [Fallback Action](./fallback-handoff.mdx#fallbacks) which handles message with uncertain
    NLU predictions.

    ```yaml-rasa
    rules:

    - rule: Ask the user to rephrase in case of low NLU confidence
      steps:
      - intent: nlu_fallback
      - action: utter_please_rephrase
    ```
* **Configuration**

    The `FallbackClassifier` will only add its prediction for the `nlu_fallback`
    intent in case no other intent was predicted with a confidence greater or equal
    than `threshold`.

    - `threshold`:
      This parameter sets the threshold for predicting the `nlu_fallback` intent.
      If no intent predicted by a previous
      intent classifier has a confidence
      level greater or equal than `threshold` the `FallbackClassifier` will add
      a prediction of the `nlu_fallback` intent with a confidence `1.0`.
    - `ambiguity_threshold`: If you configure an `ambiguity_threshold`, the
      `FallbackClassifier` will also predict the `nlu_fallback` intent in case
      the difference of the confidence scores for the two highest ranked intents is
      smaller than the `ambiguity_threshold`.


## Entity Extractors

Entity extractors extract entities, such as person names or locations, from the user message.

:::note
  If you use multiple entity extractors, we advise that each extractor targets an exclusive
  set of entity types. For example, use [Duckling](components.mdx#ducklingentityextractor) to extract dates and times, and
  [DIETClassifier](components.mdx#dietclassifier-1) to extract person names. Otherwise, if multiple extractors
  target the same entity types, it is very likely that entities will be extracted multiple times.

  For example, if you use two or more general purpose extractors like [MitieEntityExtractor](components.mdx#mitieentityextractor),
  [DIETClassifier](components.mdx#dietclassifier-1), or [CRFEntityExtractor](components.mdx#crfentityextractor),
  the entity types in your training data will be found and
  extracted by all of them. If the [slots](domain.mdx#slots) you are filling with your entity types are of type `text`,
  then the last extractor in your pipeline will win. If the slot is of type `list`, then all results
  will be added to the list, including duplicates.

  Another, less obvious case of duplicate/overlapping extraction can happen even if extractors focus on different
  entity types. Imagine a food delivery bot and a user message like `I would like to order the Monday special`.
  Hypothetically, if your time extractor's performance isn't very good, it might extract `Monday` here as a time for the order,
  and your other extractor might extract `Monday special` as the meal.
  If you struggle with overlapping entities of this sort, it might make sense to add additional training data
  to improve your extractor. If that does not suffice, you can add a
  [custom component](components.mdx#custom-components) that resolves conflicts in entity
  extraction according to your own logic.
:::
### MitieEntityExtractor


* **Short**

  MITIE entity extraction (using a [MITIE NER trainer](https://github.com/mit-nlp/MITIE/blob/master/mitielib/src/ner_trainer.cpp))



* **Outputs**

  `entities`



* **Requires**

  [MitieNLP](./components.mdx#mitienlp) and `tokens`



* **Output-Example**

  ```json
  {
      "entities": [{
          "value": "New York City",
          "start": 20,
          "end": 33,
          "confidence": null,
          "entity": "city",
          "extractor": "MitieEntityExtractor"
      }]
  }
  ```



* **Description**

  `MitieEntityExtractor` uses the MITIE entity extraction to find entities in a message. The underlying classifier
  is using a multi class linear SVM with a sparse linear kernel and custom features.
  The MITIE component does not provide entity confidence values.

  :::note
  This entity extractor does not rely on any featurizer as it extracts features on its own.

  :::



* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "MitieEntityExtractor"
  ```


### SpacyEntityExtractor


* **Short**

  spaCy entity extraction



* **Outputs**

  `entities`



* **Requires**

  [SpacyNLP](./components.mdx#spacynlp)



* **Output-Example**

  ```json
  {
      "entities": [{
          "value": "New York City",
          "start": 20,
          "end": 33,
          "confidence": null,
          "entity": "city",
          "extractor": "SpacyEntityExtractor"
      }]
  }
  ```



* **Description**

  Using spaCy this component predicts the entities of a message. spaCy uses a statistical BILOU transition model.
  As of now, this component can only use the spaCy builtin entity extraction models and can not be retrained.
  This extractor does not provide any confidence scores.

  You can test out spaCy's entity extraction models in this [interactive demo](https://explosion.ai/demos/displacy-ent).
  Note that some spaCy models are highly case-sensitive.

:::note
The `SpacyEntityExtractor` extractor does not provide a `confidence` level and will always return `null`.

:::

* **Configuration**

  Configure which dimensions, i.e. entity types, the spaCy component
  should extract. A full list of available dimensions can be found in
  the [spaCy documentation](https://spacy.io/api/annotation#section-named-entities).
  Leaving the dimensions option unspecified will extract all available dimensions.

  ```yaml-rasa
  pipeline:
  - name: "SpacyEntityExtractor"
    # dimensions to extract
    dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]
  ```


### CRFEntityExtractor


* **Short**

  Conditional random field (CRF) entity extraction



* **Outputs**

  `entities`



* **Requires**

  `tokens` and `dense_features` (optional)



* **Output-Example**

  ```json
  {
      "entities": [{
          "value": "New York City",
          "start": 20,
          "end": 33,
          "entity": "city",
          "confidence": 0.874,
          "extractor": "CRFEntityExtractor"
      }]
  }
  ```



* **Description**

  This component implements a conditional random fields (CRF) to do named entity recognition.
  CRFs can be thought of as an undirected Markov chain where the time steps are words
  and the states are entity classes. Features of the words (capitalization, POS tagging,
  etc.) give probabilities to certain entity classes, as are transitions between
  neighbouring entity tags: the most likely set of tags is then calculated and returned.


  If you want to pass custom features, such as pre-trained word embeddings, to `CRFEntityExtractor`, you can
  add any dense featurizer to the pipeline before the `CRFEntityExtractor` and subsequently configure
  `CRFEntityExtractor` to make use of the dense features by adding `"text_dense_feature"` to its feature configuration.
  `CRFEntityExtractor` automatically finds the additional dense features and checks if the dense features are an
  iterable of `len(tokens)`, where each entry is a vector.
  A warning will be shown in case the check fails.
  However, `CRFEntityExtractor` will continue to train just without the additional custom features.
  In case dense features are present, `CRFEntityExtractor` will pass the dense features to `sklearn_crfsuite`
  and use them for training.



* **Configuration**

  `CRFEntityExtractor` has a list of default features to use.
  However, you can overwrite the default configuration.
  The following features are available:

  ```
  ===================  ==========================================================================================
  Feature Name         Description
  ===================  ==========================================================================================
  low                  word identity - use the lower-cased token as a feature.
  upper                Checks if the token is upper case.
  title                Checks if the token starts with an uppercase character and all remaining characters are
                       lowercased.
  digit                Checks if the token contains just digits.
  prefix5              Take the first five characters of the token.
  prefix2              Take the first two characters of the token.
  suffix5              Take the last five characters of the token.
  suffix3              Take the last three characters of the token.
  suffix2              Take the last two characters of the token.
  suffix1              Take the last character of the token.
  pos                  Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
  pos2                 Take the first two characters of the Part-of-Speech tag of the token
                       (``SpacyTokenizer`` required).
  pattern              Take the patterns defined by ``RegexFeaturizer``.
  bias                 Add an additional "bias" feature to the list of features.
  text_dense_features  Adds additional features from a dense featurizer.
  ===================  ==========================================================================================
  ```

  As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for
  previous tokens, the current token, and the next tokens in the sliding window.
  You define the features as [before, token, after] array.

  Additional you can set a flag to determine whether to use the BILOU tagging schema or not.

  * `BILOU_flag` determines whether to use BILOU tagging or not. Default `True`.

  ```yaml-rasa
  pipeline:
  - name: "CRFEntityExtractor"
    # BILOU_flag determines whether to use BILOU tagging or not.
    "BILOU_flag": True
    # features to extract in the sliding window
    "features": [
      ["low", "title", "upper"],
      [
        "bias",
        "low",
        "prefix5",
        "prefix2",
        "suffix5",
        "suffix3",
        "suffix2",
        "upper",
        "title",
        "digit",
        "pattern",
        "text_dense_features"
      ],
      ["low", "title", "upper"],
    ]
    # The maximum number of iterations for optimization algorithms.
    "max_iterations": 50
    # weight of the L1 regularization
    "L1_c": 0.1
    # weight of the L2 regularization
    "L2_c": 0.1
    # Name of dense featurizers to use.
    # If list is empty all available dense features are used.
    "featurizers": []
    # Indicated whether a list of extracted entities should be split into individual entities for a given entity type
    "split_entities_by_comma":
        address: False
        email: True
  ```

  :::note
  If POS features are used (`pos` or `pos2`), you need to have `SpacyTokenizer` in your pipeline.

  :::

  :::note
  If `pattern` features are used, you need to have `RegexFeaturizer` in your pipeline.

  :::

  :::note
  If `text_dense_features` features are used, you need to have a dense featurizer (e.g. `LanguageModelFeaturizer`) in
  your pipeline.

  :::

### DucklingEntityExtractor


* **Short**

  Duckling lets you extract common entities like dates,
  amounts of money, distances, and others in a number of languages.



* **Outputs**

  `entities`



* **Requires**

  Nothing



* **Output-Example**

  ```json
  {
      "entities": [{
          "end": 53,
          "entity": "time",
          "start": 48,
          "value": "2017-04-10T00:00:00.000+02:00",
          "confidence": 1.0,
          "extractor": "DucklingEntityExtractor"
      }]
  }
  ```



* **Description**

  To use this component you need to run a duckling server. The easiest
  option is to spin up a docker container using
  `docker run -p 8000:8000 rasa/duckling`.

  Alternatively, you can [install duckling directly on your
  machine](https://github.com/facebook/duckling#quickstart) and start the server.

  Duckling allows to recognize dates, numbers, distances and other structured entities
  and normalizes them.
  Please be aware that duckling tries to extract as many entity types as possible without
  providing a ranking. For example, if you specify both `number` and `time` as dimensions
  for the duckling component, the component will extract two entities: `10` as a number and
  `in 10 minutes` as a time from the text `I will be there in 10 minutes`. In such a
  situation, your application would have to decide which entity type is be the correct one.
  The extractor will always return 1.0 as a confidence, as it is a rule
  based system.

  The list of supported languages can be found in the
  [Duckling GitHub repository](https://github.com/facebook/duckling/tree/master/Duckling/Dimensions).



* **Configuration**

  Configure which dimensions, i.e. entity types, the duckling component
  should extract. A full list of available dimensions can be found in
  the [duckling project readme](https://github.com/facebook/duckling).
  Leaving the dimensions option unspecified will extract all available dimensions.

  ```yaml-rasa
  pipeline:
  - name: "DucklingEntityExtractor"
    # url of the running duckling server
    url: "http://localhost:8000"
    # dimensions to extract
    dimensions: ["time", "number", "amount-of-money", "distance"]
    # allows you to configure the locale, by default the language is
    # used
    locale: "de_DE"
    # if not set the default timezone of Duckling is going to be used
    # needed to calculate dates from relative expressions like "tomorrow"
    timezone: "Europe/Berlin"
    # Timeout for receiving response from http url of the running duckling server
    # if not set the default timeout of duckling http url is set to 3 seconds.
    timeout : 3
  ```


### DIETClassifier


* **Short**

  Dual Intent Entity Transformer (DIET) used for intent classification and entity extraction



* **Description**

  You can find the detailed description of the [DIETClassifier](./components.mdx#dietclassifier) under the section
  Intent Classifiers.


### RegexEntityExtractor

* **Short**

  Extracts entities using the lookup tables and/or regexes defined in the training data


* **Outputs**

  `entities`


* **Requires**

  Nothing


* **Description**

  This component extract entities using the [lookup tables](nlu-training-data.mdx#lookup-tables) and [regexes](nlu-training-data.mdx#regular-expressions-for-entity-extraction) defined in the training data.
  The component checks if the user message contains an entry of one of the lookup tables or matches one of the
  regexes. If a match is found, the value is extracted as entity.

  This component only uses those regex features that have a name equal to one of the entities defined in the
  training data. Make sure to annotate at least one example per entity.

  :::note
  When you use this extractor in combination with [MitieEntityExtractor](components.mdx#mitieentityextractor),
  [CRFEntityExtractor](components.mdx#crfentityextractor), or [DIETClassifier](components.mdx#dietclassifier-1) it can
  lead to multiple extraction of entities. Especially if many training sentences have entity annotations for
  the entity types for which you also have defined regexes. See the big info box at the start of the
  [entity extractor section](components.mdx#entity-extractors) for more info on multiple extraction.

  In the case where you seem to need both this RegexEntityExtractor and another of the aforementioned
  statistical extractors, we advise you to consider one of the following two options.

  Option 1 is advisable when you have exclusive entity types for each type of extractor. To make the
  sure the extractors don't interfere with one another annotate only one example sentence for each
  regex/lookup entity type, but not more.

  Option 2 is useful when you want to use regexes matches as additional signal for your statistical extractor,
  but you don't have separate entity types. In this case you will want to 1) add the
  [RegexFeaturizer](components.mdx#regexfeaturizer) before the extractors in your pipeline 2)
  annotate all your entity examples in the training data and 3) remove the RegexEntityExtractor from your pipeline.
  This way, your statistical extractors will receive additional signal about the presence of regex matches
  and will be able to statistically determine when to rely on these matches and when not to.
  :::


* **Configuration**

  Make the entity extractor case sensitive by adding the `case_sensitive: True` option, the default being
  `case_sensitive: False`.

  To correctly process languages such as Chinese that don't use whitespace for word separation,
  the user needs to add the `use_word_boundaries: False` option, the default being `use_word_boundaries: True`.

  ```yaml-rasa
      pipeline:
      - name: RegexEntityExtractor
        # text will be processed with case insensitive as default
        case_sensitive: False
        # use lookup tables to extract entities
        use_lookup_tables: True
        # use regexes to extract entities
        use_regexes: True
        # use match word boundaries for lookup table
        use_word_boundaries: True
  ```


### EntitySynonymMapper


* **Short**

  Maps synonymous entity values to the same value.



* **Outputs**

  Modifies existing entities that previous entity extraction components found.



* **Requires**

  An extractor from [Entity Extractors](./components.mdx)



* **Description**

  If the training data contains defined synonyms, this component will make sure that detected entity values will
  be mapped to the same value. For example, if your training data contains the following examples:

  ```json
  [
      {
        "text": "I moved to New York City",
        "intent": "inform_relocation",
        "entities": [{
          "value": "nyc",
          "start": 11,
          "end": 24,
          "entity": "city",
        }]
      },
      {
        "text": "I got a new flat in NYC.",
        "intent": "inform_relocation",
        "entities": [{
          "value": "nyc",
          "start": 20,
          "end": 23,
          "entity": "city",
        }]
      }
  ]
  ```

  This component will allow you to map the entities `New York City` and `NYC` to `nyc`. The entity
  extraction will return `nyc` even though the message contains `NYC`. When this component changes an
  existing entity, it appends itself to the processor list of this entity.



* **Configuration**

  ```yaml-rasa
  pipeline:
  - name: "EntitySynonymMapper"
  ```

  :::note
  When using the `EntitySynonymMapper` as part of an NLU pipeline, it will need to be placed
  below any entity extractors in the configuration file.

  :::


  ## Combined Intent Classifiers and Entity Extractors

  ### DIETClassifier


  * **Short**

    Dual Intent Entity Transformer (DIET) used for intent classification and entity extraction



  * **Outputs**

    `entities`, `intent` and `intent_ranking`



  * **Requires**

    `dense_features` and/or `sparse_features` for user message and optionally the intent



  * **Output-Example**

    ```json
    {
        "intent": {"name": "greet", "confidence": 0.7800},
        "intent_ranking": [
            {
                "confidence": 0.7800,
                "name": "greet"
            },
            {
                "confidence": 0.1400,
                "name": "goodbye"
            },
            {
                "confidence": 0.0800,
                "name": "restaurant_search"
            }
        ],
        "entities": [{
            "end": 53,
            "entity": "time",
            "start": 48,
            "value": "2017-04-10T00:00:00.000+02:00",
            "confidence": 1.0,
            "extractor": "DIETClassifier"
        }]
    }
    ```



  * **Description**

    DIET (Dual Intent and Entity Transformer) is a multi-task architecture for intent classification and entity
    recognition. The architecture is based on a transformer which is shared for both tasks.
    A sequence of entity labels is predicted through a Conditional Random Field (CRF) tagging layer on top of the
    transformer output sequence corresponding to the input sequence of tokens.
    For the intent labels the transformer output for the complete utterance and intent labels are embedded into a
    single semantic vector space. We use the dot-product loss to maximize the similarity with the target label and
    minimize similarities with negative samples.

    If you want to learn more about the model, check out the
    [Algorithm Whiteboard](https://www.youtube.com/playlist?list=PL75e0qA87dlG-za8eLI6t0_Pbxafk-cxb) series on YouTube,
    where we explain the model architecture in detail.

    :::note
    If during prediction time a message contains **only** words unseen during training
    and no Out-Of-Vocabulary preprocessor was used, an empty intent `None` is predicted with confidence
    `0.0`. This might happen if you only use the [CountVectorsFeaturizer](./components.mdx#countvectorsfeaturizer) with a `word` analyzer
    as featurizer. If you use the `char_wb` analyzer, you should always get an intent with a confidence
    value `> 0.0`.

    :::



  * **Configuration**

    If you want to use the `DIETClassifier` just for intent classification, set `entity_recognition` to `False`.
    If you want to do only entity recognition, set `intent_classification` to `False`.
    By default `DIETClassifier` does both, i.e. `entity_recognition` and `intent_classification` are set to
    `True`.

    You can define a number of hyperparameters to adapt the model.
    If you want to adapt your model, start by modifying the following parameters:

    * `epochs`:
      This parameter sets the number of times the algorithm will see the training data (default: `300`).
      One `epoch` is equals to one forward pass and one backward pass of all the training examples.
      Sometimes the model needs more epochs to properly learn.
      Sometimes more epochs don't influence the performance.
      The lower the number of epochs the faster the model is trained.

    * `hidden_layers_sizes`:
      This parameter allows you to define the number of feed forward layers and their output
      dimensions for user messages and intents (default: `text: [], label: []`).
      Every entry in the list corresponds to a feed forward layer.
      For example, if you set `text: [256, 128]`, we will add two feed forward layers in front of
      the transformer. The vectors of the input tokens (coming from the user message) will be passed on to those
      layers. The first layer will have an output dimension of 256 and the second layer will have an output
      dimension of 128. If an empty list is used (default behavior), no feed forward layer will be
      added.
      Make sure to use only positive integer values. Usually, numbers of power of two are used.
      Also, it is usual practice to have decreasing values in the list: next value is smaller or equal to the
      value before.

    * `embedding_dimension`:
      This parameter defines the output dimension of the embedding layers used inside the model (default: `20`).
      We are using multiple embeddings layers inside the model architecture.
      For example, the vector of the complete utterance and the intent is passed on to an embedding layer before
      they are compared and the loss is calculated.

    * `number_of_transformer_layers`:
      This parameter sets the number of transformer layers to use (default: `2`).
      The number of transformer layers corresponds to the transformer blocks to use for the model.

    * `transformer_size`:
      This parameter sets the number of units in the transformer (default: `256`).
      The vectors coming out of the transformers will have the given `transformer_size`.
      The `transformer_size` should be a multiple of the `number_of_attention_heads` parameter,
      the training exits with an error otherwise.

    * `connection_density`:
      This parameter defines the fraction of kernel weights that are set to non zero values for all feed forward
      layers in the model (default: `0.2`). The value should be between 0 and 1. If you set `connection_density`
      to 1, no kernel weights will be set to 0, the layer acts as a standard feed forward layer. You should not
      set `connection_density` to 0 as this would result in all kernel weights being 0, i.e. the model is not able
      to learn.

    * `BILOU_flag`:
      This parameter determines whether to use BILOU tagging or not. Default `True`.

  The above configuration parameters are the ones you should configure to fit your model to your data.
  However, additional parameters exist that can be adapted.

  <details><summary>More configurable parameters</summary>

  ```
  +---------------------------------+------------------+--------------------------------------------------------------+
  | Parameter                       | Default Value    | Description                                                  |
  +=================================+==================+==============================================================+
  | hidden_layers_sizes             | text: []         | Hidden layer sizes for layers before the embedding layers    |
  |                                 | label: []        | for user messages and labels. The number of hidden layers is |
  |                                 |                  | equal to the length of the corresponding.                    |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | share_hidden_layers             | False            | Whether to share the hidden layer weights between user       |
  |                                 |                  | messages and labels.                                         |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | transformer_size                | 256              | Number of units in transformer.                              |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | number_of_transformer_layers    | 2                | Number of transformer layers.                                |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | number_of_attention_heads       | 4                | Number of attention heads in transformer.                    |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | use_key_relative_attention      | False            | If 'True' use key relative embeddings in attention.          |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | use_value_relative_attention    | False            | If 'True' use value relative embeddings in attention.        |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | max_relative_position           | None             | Maximum position for relative embeddings.                    |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | unidirectional_encoder          | False            | Use a unidirectional or bidirectional encoder.               |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | batch_size                      | [64, 256]        | Initial and final value for batch sizes.                     |
  |                                 |                  | Batch size will be linearly increased for each epoch.        |
  |                                 |                  | If constant `batch_size` is required, pass an int, e.g. `8`. |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | batch_strategy                  | "balanced"       | Strategy used when creating batches.                         |
  |                                 |                  | Can be either 'sequence' or 'balanced'.                      |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | epochs                          | 300              | Number of epochs to train.                                   |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | random_seed                     | None             | Set random seed to any 'int' to get reproducible results.    |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | learning_rate                   | 0.001            | Initial learning rate for the optimizer.                     |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | embedding_dimension             | 20               | Dimension size of embedding vectors.                         |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | dense_dimension                 | text: 128        | Dense dimension for sparse features to use if no dense       |
  |                                 | label: 20        | features are present.                                        |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | concat_dimension                | text: 128        | Concat dimension for sequence and sentence features.         |
  |                                 | label: 20        |                                                              |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | number_of_negative_examples     | 20               | The number of incorrect labels. The algorithm will minimize  |
  |                                 |                  | their similarity to the user input during training.          |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | similarity_type                 | "auto"           | Type of similarity measure to use, either 'auto' or 'cosine' |
  |                                 |                  | or 'inner'.                                                  |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | loss_type                       | "cross_entropy"  | The type of the loss function, either 'cross_entropy'        |
  |                                 |                  | or 'margin'. Type 'margin' is only compatible with           |
  |                                 |                  | "model_confidence=cosine",                                   |
  |                                 |                  | which is deprecated (see changelog for 2.3.4).               |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | ranking_length                  | 10               | Number of top intents  to report. Set to 0 to report all     |
  |                                 |                  | intents.                                                     |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | renormalize_confidences         | False            | Normalize the reported top intents. Applicable only with loss|
  |                                 |                  | type 'cross_entropy' and 'softmax' confidences.              |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | maximum_positive_similarity     | 0.8              | Indicates how similar the algorithm should try to make       |
  |                                 |                  | embedding vectors for correct labels.                        |
  |                                 |                  | Should be 0.0 < ... < 1.0 for 'cosine' similarity type.      |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | maximum_negative_similarity     | -0.4             | Maximum negative similarity for incorrect labels.            |
  |                                 |                  | Should be -1.0 < ... < 1.0 for 'cosine' similarity type.     |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | use_maximum_negative_similarity | True             | If 'True' the algorithm only minimizes maximum similarity    |
  |                                 |                  | over incorrect intent labels, used only if 'loss_type' is    |
  |                                 |                  | set to 'margin'.                                             |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | scale_loss                      | False            | Scale loss inverse proportionally to confidence of correct   |
  |                                 |                  | prediction.                                                  |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | regularization_constant         | 0.002            | The scale of regularization.                                 |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | negative_margin_scale           | 0.8              | The scale of how important it is to minimize the maximum     |
  |                                 |                  | similarity between embeddings of different labels.           |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | connection_density              | 0.2              | Connection density of the weights in dense layers.           |
  |                                 |                  | Value should be between 0 and 1.                             |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | drop_rate                       | 0.2              | Dropout rate for encoder. Value should be between 0 and 1.   |
  |                                 |                  | The higher the value the higher the regularization effect.   |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | drop_rate_attention             | 0.0              | Dropout rate for attention. Value should be between 0 and 1. |
  |                                 |                  | The higher the value the higher the regularization effect.   |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | use_sparse_input_dropout        | True             | If 'True' apply dropout to sparse input tensors.             |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | use_dense_input_dropout         | True             | If 'True' apply dropout to dense input tensors.              |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | evaluate_every_number_of_epochs | 20               | How often to calculate validation accuracy.                  |
  |                                 |                  | Set to '-1' to evaluate just once at the end of training.    |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | evaluate_on_number_of_examples  | 0                | How many examples to use for hold out validation set.        |
  |                                 |                  | Large values may hurt performance, e.g. model accuracy.      |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | intent_classification           | True             | If 'True' intent classification is trained and intents are   |
  |                                 |                  | predicted.                                                   |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | entity_recognition              | True             | If 'True' entity recognition is trained and entities are     |
  |                                 |                  | extracted.                                                   |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | use_masked_language_model       | False            | If 'True' random tokens of the input message will be masked  |
  |                                 |                  | and the model has to predict those tokens. It acts like a    |
  |                                 |                  | regularizer and should help to learn a better contextual     |
  |                                 |                  | representation of the input.                                 |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | BILOU_flag                      | True             | If 'True', additional BILOU tags are added to entity labels. |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | tensorboard_log_directory       | None             | If you want to use tensorboard to visualize training         |
  |                                 |                  | metrics, set this option to a valid output directory. You    |
  |                                 |                  | can view the training metrics after training in tensorboard  |
  |                                 |                  | via 'tensorboard --logdir <path-to-given-directory>'.        |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | tensorboard_log_level           | "epoch"          | Define when training metrics for tensorboard should be       |
  |                                 |                  | logged. Either after every epoch ('epoch') or for every      |
  |                                 |                  | training step ('batch').                                 |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | featurizers                     | []               | List of featurizer names (alias names). Only features        |
  |                                 |                  | coming from the listed names are used. If list is empty      |
  |                                 |                  | all available features are used.                             |
  +---------------------------------+------------------+--------------------------------------------------------------+
  | checkpoint_model                | False            | Save the best performing model during training. Models are   |
  |                                 |                  | stored to the location specified by `--out`. Only the one    |
  |                                 |                  | best model will be saved.                                    |
  |                                 |                  | Requires `evaluate_on_number_of_examples > 0` and            |
  |                                 |                  | `evaluate_every_number_of_epochs > 0`                        |
  +---------------------------------+------------------+--------------------------------------------------------------+
  ```

  :::note
  Parameter `maximum_negative_similarity` is set to a negative value to mimic the original
  starspace algorithm in the case `maximum_negative_similarity = maximum_positive_similarity`
  and `use_maximum_negative_similarity = False`.
  See [starspace paper](https://arxiv.org/abs/1709.03856) for details.

  :::

  </details>

## Selectors

Selectors predict a bot response from a set of candidate responses.


### ResponseSelector


* **Short**

  Response Selector



* **Outputs**

  A dictionary with the key as the retrieval intent of the response selector
   and value containing predicted responses, confidence and the response key under the retrieval intent



* **Requires**

  `dense_features` and/or `sparse_features` for user messages and response



*   **Output-Example**

    The parsed output from NLU will have a property named `response_selector`
    containing the output for each response selector component. Each response selector is
    identified by `retrieval_intent` parameter of that response selector
    and stores two properties:

    * `response`: The predicted response key under the corresponding retrieval intent,
    prediction's confidence and the associated responses.

    * `ranking`: Ranking with confidences of top 10 candidate response keys.

    Example result:

    ```json
    {
        "response_selector": {
          "faq": {
            "response": {
              "id": 1388783286124361986,
              "confidence": 0.7,
              "intent_response_key": "chitchat/ask_weather",
              "responses": [
                {
                  "text": "It's sunny in Berlin today",
                  "image": "https://i.imgur.com/nGF1K8f.jpg"
                },
                {
                  "text": "I think it's about to rain."
                }
              ],
              "utter_action": "utter_chitchat/ask_weather"
             },
            "ranking": [
              {
                "id": 1388783286124361986,
                "confidence": 0.7,
                "intent_response_key": "chitchat/ask_weather"
              },
              {
                "id": 1388783286124361986,
                "confidence": 0.3,
                "intent_response_key": "chitchat/ask_name"
              }
            ]
          }
        }
    }
    ```

    If the `retrieval_intent` parameter of a particular response selector was left to its default value,
    the corresponding response selector will be identified as `default` in the returned output.

    ```json {3}
    {
        "response_selector": {
          "default": {
            "response": {
              "id": 1388783286124361986,
              "confidence": 0.7,
              "intent_response_key": "chitchat/ask_weather",
              "responses": [
                {
                  "text": "It's sunny in Berlin today",
                  "image": "https://i.imgur.com/nGF1K8f.jpg"
                },
                {
                  "text": "I think it's about to rain."
                }
              ],
              "utter_action": "utter_chitchat/ask_weather"
             },
            "ranking": [
              {
                "id": 1388783286124361986,
                "confidence": 0.7,
                "intent_response_key": "chitchat/ask_weather"
              },
              {
                "id": 1388783286124361986,
                "confidence": 0.3,
                "intent_response_key": "chitchat/ask_name"
              }
            ]
          }
        }
    }
    ```

* **Description**

  Response Selector component can be used to build a response retrieval model to directly predict a bot response from
  a set of candidate responses. The prediction of this model is used by the dialogue manager to utter the predicted responses.
  It embeds user inputs and response labels into the same space and follows the exact same
  neural network architecture and optimization as the [DIETClassifier](./components.mdx#dietclassifier).

  To use this component, your training data should contain [retrieval intents](./glossary.mdx#retrieval-intent). To define these,
  checkout [documentation on NLU training examples](./training-data-format.mdx#training-examples) and
  [documentation on defining response utterances for retrieval intents](./responses.mdx#defining-responses).

  :::note
  If during prediction time a message contains **only** words unseen during training
  and no Out-Of-Vocabulary preprocessor was used, an empty response `None` is predicted with confidence
  `0.0`. This might happen if you only use the [CountVectorsFeaturizer](./components.mdx#countvectorsfeaturizer) with a `word` analyzer
  as featurizer. If you use the `char_wb` analyzer, you should always get a response with a confidence
  value `> 0.0`.

  :::



* **Configuration**

  The algorithm includes almost all the hyperparameters that [DIETClassifier](./components.mdx#dietclassifier) uses.
  If you want to adapt your model, start by modifying the following parameters:

  * `epochs`:
    This parameter sets the number of times the algorithm will see the training data (default: `300`).
    One `epoch` is equals to one forward pass and one backward pass of all the training examples.
    Sometimes the model needs more epochs to properly learn.
    Sometimes more epochs don't influence the performance.
    The lower the number of epochs the faster the model is trained.

  * `hidden_layers_sizes`:
    This parameter allows you to define the number of feed forward layers and their output
    dimensions for user messages and intents (default: `text: [256, 128], label: [256, 128]`).
    Every entry in the list corresponds to a feed forward layer.
    For example, if you set `text: [256, 128]`, we will add two feed forward layers in front of
    the transformer. The vectors of the input tokens (coming from the user message) will be passed on to those
    layers. The first layer will have an output dimension of 256 and the second layer will have an output
    dimension of 128. If an empty list is used (default behavior), no feed forward layer will be
    added.
    Make sure to use only positive integer values. Usually, numbers of power of two are used.
    Also, it is usual practice to have decreasing values in the list: next value is smaller or equal to the
    value before.

  * `embedding_dimension`:
    This parameter defines the output dimension of the embedding layers used inside the model (default: `20`).
    We are using multiple embeddings layers inside the model architecture.
    For example, the vector of the complete utterance and the intent is passed on to an embedding layer before
    they are compared and the loss is calculated.

  * `number_of_transformer_layers`:
    This parameter sets the number of transformer layers to use (default: `0`).
    The number of transformer layers corresponds to the transformer blocks to use for the model.

  * `transformer_size`:
    This parameter sets the number of units in the transformer (default: `None`).
    The vectors coming out of the transformers will have the given `transformer_size`.
    The `transformer_size` should be a multiple of the `number_of_attention_heads` parameter,
    the training exits with an error otherwise.

  * `connection_density`:
    This parameter defines the fraction of kernel weights that are set to non zero values for all feed forward
    layers in the model (default: `0.2`). The value should be between 0 and 1. If you set `connection_density`
    to 1, no kernel weights will be set to 0, the layer acts as a standard feed forward layer. You should not
    set `connection_density` to 0 as this would result in all kernel weights being 0, i.e. the model is not able
    to learn.

  * `constrain_similarities`:
    This parameter when set to `True` applies a sigmoid cross entropy loss over all similarity terms.
    This helps in keeping similarities between input and negative labels to smaller values.
    This should help in better generalization of the model to real world test sets.

  * `model_confidence`:
      This parameter allows the user to configure how confidences are computed during inference. It can take only 
    one value as input which is `softmax`. In `softmax`, confidences are in the range `[0, 1]`. The computed 
    similarities are normalized with the `softmax` activation function.

  The component can also be configured to train a response selector for a particular retrieval intent.
  The parameter `retrieval_intent` sets the name of the retrieval intent for which this response selector model is trained.
  Default is `None`, i.e. the model is trained for all retrieval intents.

  In its default configuration, the component uses the retrieval intent with the response key(e.g. `faq/ask_name`) as the label for training.
  Alternatively, it can also be configured to use the text of the responses as the training label
  by switching `use_text_as_label` to `True`. In this mode, the component will use the first available response which has a text attribute for training. If none are found, it falls back to using the retrieval intent
  combined with the response key as the label.

  :::note examples and tutorials
  Check out the [responseselectorbot](https://github.com/RasaHQ/rasa/tree/main/examples/responseselectorbot) for an example of
  how you can use the `ResponseSelector` component in your assistant. Additionally, you will find this tutorial on
  [handling FAQs](./chitchat-faqs.mdx) using a `ResponseSelector` useful as well.
  :::


The above configuration parameters are the ones you should configure to fit your model to your data.
However, additional parameters exist that can be adapted.

<details><summary>More configurable parameters</summary>

```text
+---------------------------------+-------------------+--------------------------------------------------------------+
| Parameter                       | Default Value     | Description                                                  |
+=================================+===================+==============================================================+
| hidden_layers_sizes             | text: [256, 128]  | Hidden layer sizes for layers before the embedding layers    |
|                                 | label: [256, 128] | for user messages and labels. The number of hidden layers is |
|                                 |                   | equal to the length of the corresponding list. We recommend  |
|                                 |                   | disabling the hidden layers (by providing empty lists) when  |
|                                 |                   | the transformer is enabled.                                  |
+---------------------------------+-------------------+--------------------------------------------------------------+
| share_hidden_layers             | False             | Whether to share the hidden layer weights between user       |
|                                 |                   | messages and labels.                                         |
+---------------------------------+-------------------+--------------------------------------------------------------+
| transformer_size                | None              | Number of units in the transformer. When a positive value is |
|                                 |                   | provided for `number_of_transformer_layers`, the default size|
|                                 |                   | becomes `256`.                                               |
+---------------------------------+-------------------+--------------------------------------------------------------+
| number_of_transformer_layers    | 0                 | Number of transformer layers; positive values enable the     |
|                                 |                   | transformer.                                                 |
+---------------------------------+-------------------+--------------------------------------------------------------+
| number_of_attention_heads       | 4                 | Number of attention heads in transformer.                    |
+---------------------------------+-------------------+--------------------------------------------------------------+
| use_key_relative_attention      | False             | If 'True' use key relative embeddings in attention.          |
+---------------------------------+-------------------+--------------------------------------------------------------+
| use_value_relative_attention    | False             | If 'True' use value relative embeddings in attention.        |
+---------------------------------+-------------------+--------------------------------------------------------------+
| max_relative_position           | None              | Maximum position for relative embeddings.                    |
+---------------------------------+-------------------+--------------------------------------------------------------+
| unidirectional_encoder          | False             | Use a unidirectional or bidirectional encoder.               |
+---------------------------------+-------------------+--------------------------------------------------------------+
| batch_size                      | [64, 256]         | Initial and final value for batch sizes.                     |
|                                 |                   | Batch size will be linearly increased for each epoch.        |
|                                 |                   | If constant `batch_size` is required, pass an int, e.g. `8`. |
+---------------------------------+-------------------+--------------------------------------------------------------+
| batch_strategy                  | "balanced"        | Strategy used when creating batches.                         |
|                                 |                   | Can be either 'sequence' or 'balanced'.                      |
+---------------------------------+-------------------+--------------------------------------------------------------+
| epochs                          | 300               | Number of epochs to train.                                   |
+---------------------------------+-------------------+--------------------------------------------------------------+
| random_seed                     | None              | Set random seed to any 'int' to get reproducible results.    |
+---------------------------------+-------------------+--------------------------------------------------------------+
| learning_rate                   | 0.001             | Initial learning rate for the optimizer.                     |
+---------------------------------+-------------------+--------------------------------------------------------------+
| embedding_dimension             | 20                | Dimension size of embedding vectors.                         |
+---------------------------------+-------------------+--------------------------------------------------------------+
| dense_dimension                 | text: 512         | Dense dimension for sparse features to use if no dense       |
|                                 | label: 512        | features are present.                                        |
+---------------------------------+-------------------+--------------------------------------------------------------+
| concat_dimension                | text: 512         | Concat dimension for sequence and sentence features.         |
|                                 | label: 512        |                                                              |
+---------------------------------+-------------------+--------------------------------------------------------------+
| number_of_negative_examples     | 20                | The number of incorrect labels. The algorithm will minimize  |
|                                 |                   | their similarity to the user input during training.          |
+---------------------------------+-------------------+--------------------------------------------------------------+
| similarity_type                 | "auto"            | Type of similarity measure to use, either 'auto' or 'cosine' |
|                                 |                   | or 'inner'.                                                  |
+---------------------------------+-------------------+--------------------------------------------------------------+
| loss_type                       | "cross_entropy"   | The type of the loss function, either 'cross_entropy'        |
|                                 |                   | or 'margin'. Type 'margin' is only compatible with           |
|                                 |                   | "model_confidence=cosine",                                   |
|                                 |                   | which is deprecated (see changelog for 2.3.4).               |
+---------------------------------+-------------------+--------------------------------------------------------------+
| ranking_length                  | 10                | Number of top responses to report. Set to 0 to report all    |
|                                 |                   | responses.                                                   |
+---------------------------------+-------------------+--------------------------------------------------------------+
| renormalize_confidences         | False             | Normalize the top responses. Applicable only with loss type  |
|                                 |                   | 'cross_entropy' and 'softmax' confidences.                   |
+---------------------------------+-------------------+--------------------------------------------------------------+
| maximum_positive_similarity     | 0.8               | Indicates how similar the algorithm should try to make       |
|                                 |                   | embedding vectors for correct labels.                        |
|                                 |                   | Should be 0.0 < ... < 1.0 for 'cosine' similarity type.      |
+---------------------------------+-------------------+--------------------------------------------------------------+
| maximum_negative_similarity     | -0.4              | Maximum negative similarity for incorrect labels.            |
|                                 |                   | Should be -1.0 < ... < 1.0 for 'cosine' similarity type.     |
+---------------------------------+-------------------+--------------------------------------------------------------+
| use_maximum_negative_similarity | True              | If 'True' the algorithm only minimizes maximum similarity    |
|                                 |                   | over incorrect intent labels, used only if 'loss_type' is    |
|                                 |                   | set to 'margin'.                                             |
+---------------------------------+-------------------+--------------------------------------------------------------+
| scale_loss                      | True              | Scale loss inverse proportionally to confidence of correct   |
|                                 |                   | prediction.                                                  |
+---------------------------------+-------------------+--------------------------------------------------------------+
| regularization_constant         | 0.002             | The scale of regularization.                                 |
+---------------------------------+-------------------+--------------------------------------------------------------+
| negative_margin_scale           | 0.8               | The scale of how important is to minimize the maximum        |
|                                 |                   | similarity between embeddings of different labels.           |
+---------------------------------+-------------------+--------------------------------------------------------------+
| connection_density              | 0.2               | Connection density of the weights in dense layers.           |
|                                 |                   | Value should be between 0 and 1.                             |
+---------------------------------+-------------------+--------------------------------------------------------------+
| drop_rate                       | 0.2               | Dropout rate for encoder. Value should be between 0 and 1.   |
|                                 |                   | The higher the value the higher the regularization effect.   |
+---------------------------------+-------------------+--------------------------------------------------------------+
| drop_rate_attention             | 0.0               | Dropout rate for attention. Value should be between 0 and 1. |
|                                 |                   | The higher the value the higher the regularization effect.   |
+---------------------------------+-------------------+--------------------------------------------------------------+
| use_sparse_input_dropout        | False             | If 'True' apply dropout to sparse input tensors.             |
+---------------------------------+-------------------+--------------------------------------------------------------+
| use_dense_input_dropout         | False             | If 'True' apply dropout to dense input tensors.              |
+---------------------------------+-------------------+--------------------------------------------------------------+
| evaluate_every_number_of_epochs | 20                | How often to calculate validation accuracy.                  |
|                                 |                   | Set to '-1' to evaluate just once at the end of training.    |
+---------------------------------+-------------------+--------------------------------------------------------------+
| evaluate_on_number_of_examples  | 0                 | How many examples to use for hold out validation set.        |
|                                 |                   | Large values may hurt performance, e.g. model accuracy.      |
|                                 |                   | Set to 0 for no validation.                                  |
+---------------------------------+-------------------+--------------------------------------------------------------+
| use_masked_language_model       | False             | If 'True' random tokens of the input message will be masked  |
|                                 |                   | and the model should predict those tokens.                   |
+---------------------------------+-------------------+--------------------------------------------------------------+
| retrieval_intent                | None              | Name of the intent for which this response selector model is |
|                                 |                   | trained.                                                     |
+---------------------------------+-------------------+--------------------------------------------------------------+
| use_text_as_label               | False             | Whether to use the actual text of the response as the label  |
|                                 |                   | for training the response selector. Otherwise, it uses the   |
|                                 |                   | response key as the label.                                   |
+---------------------------------+-------------------+--------------------------------------------------------------+
| tensorboard_log_directory       | None              | If you want to use tensorboard to visualize training         |
|                                 |                   | metrics, set this option to a valid output directory. You    |
|                                 |                   | can view the training metrics after training in tensorboard  |
|                                 |                   | via 'tensorboard --logdir <path-to-given-directory>'.        |
+---------------------------------+-------------------+--------------------------------------------------------------+
| tensorboard_log_level           | "epoch"           | Define when training metrics for tensorboard should be       |
|                                 |                   | logged. Either after every epoch ("epoch") or for every      |
|                                 |                   | training step ("batch").                                     |
+---------------------------------+-------------------+--------------------------------------------------------------+
| featurizers                     | []                | List of featurizer names (alias names). Only features        |
|                                 |                   | coming from the listed names are used. If list is empty      |
|                                 |                   | all available features are used.                             |
+---------------------------------+-------------------+--------------------------------------------------------------+
| checkpoint_model                | False             | Save the best performing model during training. Models are   |
|                                 |                   | stored to the location specified by `--out`. Only the one    |
|                                 |                   | best model will be saved.                                    |
|                                 |                   | Requires `evaluate_on_number_of_examples > 0` and            |
|                                 |                   | `evaluate_every_number_of_epochs > 0`                        |
+---------------------------------+-------------------+--------------------------------------------------------------+
| constrain_similarities          | False             | If `True`, applies sigmoid on all similarity terms and adds  |
|                                 |                   | it to the loss function to ensure that similarity values are |
|                                 |                   | approximately bounded. Used only if `loss_type=cross_entropy`|
+---------------------------------+-------------------+--------------------------------------------------------------+
| model_confidence                | "softmax"         | Affects how model's confidence for each response label       |
|                                 |                   | is computed. Currently, only one value is supported:         |
|                                 |                   | 1. `softmax` - Similarities between input and response label |
|                                 |                   | embeddings are post-processed with a softmax function,       |
|                                 |                   | as a result of which confidence for all labels sum up to 1.  |
+---------------------------------+-------------------+--------------------------------------------------------------+
```

:::note
Parameter `maximum_negative_similarity` is set to a negative value to mimic the original
starspace algorithm in the case `maximum_negative_similarity = maximum_positive_similarity`
and `use_maximum_negative_similarity = False`.
See [starspace paper](https://arxiv.org/abs/1709.03856) for details.

:::
  </details>


## Custom Components

:::info New in 3.0
Rasa 3.0 unified the implementation of NLU components and policies.
This requires changes to custom components written for earlier versions of Rasa Open
Source. Please see the
[migration guide](migration-guide.mdx#custom-policies-and-custom-components) for a
step-by-step guide for the migration.

:::

You can create a custom component to perform a specific task which NLU doesn't currently offer (for example, sentiment analysis).

You can add a custom component to your pipeline by adding the module path.
So if you have a module called `sentiment`
containing a `SentimentAnalyzer` class:

```yaml-rasa
pipeline:
- name: "sentiment.SentimentAnalyzer"
```

See the [guide on custom graph components](custom-graph-components.mdx) for a complete guide on custom components.
Also be sure to read the section on the [Component Lifecycle](./tuning-your-model.mdx#component-lifecycle).

